0%

(ICCV 2017) Learning feature pyramids for human pose estimation

Yang W, Li S, Ouyang W, et al. Learning feature pyramids for human pose estimation[C]//The IEEE International Conference on Computer Vision (ICCV). 2017, 2.



1. Overview


1.1. Motivation

  • pyramid methods widely used at inference time
  • learning feature in DCNN still not well explored
  • existing weight initialization schemes (MSR, Xavier) are not proper for layers with branches


In this paper, it proposed Pyramid Residual Module (PRMs)

  • subsample ratios in a multi-branch network
  • weight initialization schemes

1.2. Contribution

  • PRM
  • weight initialization scheme
  • observe that the problem of activation variance accumulation introduced by identity mapping may be harmful in some scenario

1.3.1. Human Pose Estimation

  • graph structures build on handcraft features. pictorial structure, loopy structure
  • regression
  • Gaussian peaks in score maps
  • image pyramid. computation, memory

1.3.2. Multiple-layers of DCNN

  • plain network. VGG, AlexNet
  • multi-branch. inception, ResNet, ResNeXt

1.3.3. Weight Initialization

  • layer-by-layer pretraining strategy
  • Gaussian distribution. μ=0, σ=0.01
  • Xavier. sound estimation of the variance of weight
  • assume that weights are initialized close to zero, hence the nonlinear activation (sigmoid, Tanh) can be regarded as linear function
  • initialization scheme for rectifier networks
  • All above are derived for plain networks with only one branch.

1.4. Dataset

  • MPII
  • LSP



2. Framework




2.1. Pyramid Residual Modules



  • PRM-B has comparable performance with others
  • feature with smaller resolution contain relatively fewer information, we use fewer feature channels for branches with smaller scales


  • C. the number of pyramid levels
  • f_c. transformation for c-th pyramid level, design as bottleneck

2.2. Fractional Max-Pooling

  • pooling reduces the resolution too fast
  • fractional max-pooling to approximate the smoothing and subsampling process


  • s_c. subsampling ratio ∈ [2^{-M}, 1]
  • set M = 1, C = 4



3. Training and Inference


BN is less effective because of the small minibacth due to the large memory consumption of networks.

3.1. Loss Function



  • for k-th body joint z_k=(x_k, y_k), ground-truth score map S_k is generated from a Gaussian with mean z_k and variance Σ

3.2. Forward Propagation

(Assumption μ=0)
To make the variances of the output y_l approximately the same for different layers, the condition must be satisfied



  • in initialization, a proper variance for W_l should be



  • α. depends on activation function, 0.5 for ReLU, 1 for Tanh and Sigmoid

  • C_i. the number of input in l-th layers
  • n_i. the number of elements in x_c, c=1,…, C_i

3.3. Backward Propagation



  • C_o. the number of output

Special Case. C_o=C_i = 1 for plain network.

3.4. Output Accumulation

Drawbacks. identity mapping keeps increasing the variances of responses when the network goes deeper, which increase the difficulty of optimization.



  • the identity mapping will be replaced by Conv to reduce or increase the channels, which can reset the variance to small value


  • 1x1 Conv-BN-ReLU to replace identity mapping, which can stops the variance explosion
  • find that breaking the variance explosion can provide a better performance





4. Experiments


4.1. Details

  • 256x256 cropped
  • scaling, rotation, flipping and adding color noise
  • mini-batch 16
  • test. six-scale pyramid with flipping

4.2. Comparison



4.3. Ablation Study

4.3.1. Variant



4.3.2. Scale of Pyramid and Weight initialization



4.3.3. Controlling Variance Explosion

  • [w]88.5 vs [w/o]88.0 vs [baseline] 87.6